331 Final Project

Author

Hallie Christopherson, Meyli Jaeger, Tyler Luby Howard, and Spruha Nayak

PC3: Joining Data

Dataset Sources

Total Health Spending per person (International $)

  • Source: https://www.fao.org/faostat/en/#home

  • Shows the average health expenditure per person, expressed in international dollars using PPP (purchasing power parity)

Sugar per person (g per day)

  • Source: https://www.who.int/gho/en/

  • Quantity of food consumption of sugars and sweeteners (g per person per day) 2004 data is a rough extrapolation

Preprocessing Steps

Joined Dataset Sample

Top 5 Country-Year Combinations with Lowest Ratio of Sugar Consumption to Healthcare Spending
country year spending sugar
Norway 2010 8090 105
Norway 2008 8070 105
Norway 2009 7530 103
Norway 2007 7310 103
Norway 2006 6250 101
Luxembourg 2010 8180 137

Write-up

For this analysis, we are exploring two variables integral to understanding the evolution of global nutritional health and healthcare spending. The first is the number of sugars and sweeteners per person (measured in g per day). This data originates from the United Nations’ Food and Agriculture Organization’s FAOSTAT database, and is compiled by Gapminder. Data from 2004 was missing, therefore, 2004 is a rough extrapolation of values calculated by Gapminder.

The second variable is the total health spending per person as measured in international dollars, represented using purchasing power parity (PPP), a currency conversion rate that equalizes different currencies by removing differences in price levels amongst countries. This data comes from the World Health Organization’s Global Health Expenditure Database (GHED).

We hypothesize that these two variables are strongly related and that increases in sugar consumption result in rising health spending worldwide per person. An article from UC Berkeley Public Health written by Berthold (2023) supports this, explaining that a local soda tax in Oakland, CA resulted in a 26.8% drop in the purchase of sugar-sweetened beverages. We are exploring whether preventing diseases associated with these sugary beverages (diabetes, heart disease, stroke, gum disease) reduces health care costs, and are extrapolating this pattern to a global sphere.

To prepare the data for analysis, we combined two datasets: one reporting average sugar consumption per person per day and the other detailing health care spending per person, both measured by country and year. Each dataset originally contained a wide format with multiple year columns; we reshaped them so that each row represented a single country-year observation.

After reshaping, we checked for duplicate country-year combinations and found none. Each row corresponded to a unique country-year pair, confirming structural integrity prior to joining. Before merging, we also ensured consistency in country names and removed any observations lacking year or country information.

The merged dataset includes observations from 1961 to 2018, though not all countries report data for every year. The original datasets had 179 and 190 rows, respectively, reflecting how the reshaping and merging process expanded the data based on multiple years per country. The final cleaned dataset contains 2,585 rows, incorporating all meaningful observations while excluding any instance where both sugar and spending values are missing.

We validated our join by identifying unmatched country-year pairs using anti_join(), finding 7,742 country-year combinations in the sugar dataset with no match in the spending dataset. This highlights that there are large data gaps in health spending records.

PC4: Joining Data

Data Visualization

Static Visualization: Sugar Consumption vs. Health Spending by Country

The following scatterplot visualizes the relationship between sugar consumption and health care spending. Each point represents a single country’s average values across all recorded years. The estimated linear trend, shown in blue, highlights the direction of the association.

The upward-sloping regression line suggests that, in general, countries with higher sugar consumption tend to spend more on health care per person. However, the wide spread of the points around the line indicates that other factors beyond sugar intake likely influence health spending as well.

Animated Visualization: Sugar Consumption vs. Health Spending Over Time

To examine how this relationship evolves over time, we created an animated plot showing annual trends from 1995–2010. Each frame displays data points for all countries in a single year, with the red line representing the year-specific linear trend.

This animation shows how the relationship between sugar consumption (g/person/day) and health spending (Intl $/person) has changed across countries from 1995 to 2010. Each point represents a country in a given year, and the blue line shows the trend for that year using linear regression.

While the overall association remains positive, the strength and spread of this relationship fluctuate. For example, from the late 1990s onward, some countries exhibit rapid increases in spending despite stable sugar levels, suggesting the influence of confounding factors such as economic development or healthcare policy.

Linear Model

To further examine the relationship between sugar consumption and health spending, we fit a linear regression model using average sugar consumption as the predictor and average health spending as the response, using country-level averages. We summarized our data set by averaging sugar consumption and health spending across all years by country. This stabilizes the trendline under animation, as it abstracts away intra-year fluctuations while still incorporating an adequately long period of data collection.

We fit the following linear regression model:

\[ \hat{y} = -502.829 + 14.035x \]

where \(\hat{y}\) is average healthcare spending per person per year, and \(x\) is average daily sugar intake (g/person/day).

Linear Regression Coefficients
Estimate Std_Error t_value p_value
(Intercept) -502.829 177.412 -2.834 0.00517
avg_sugar 14.035 1.914 7.334 0.00000

The intercept implies that countries with near-zero sugar consumption would have an estimated health spending of $-503 (not interpretable in isolation). The minimum observed sugar consumption across all countries was approximately 7.618 grams/day, confirming that the intercept at ~ -503 is an extrapolation outside the data range. The slope indicates that for every additional gram of sugar consumed per person per day, a country’s average health spending increases by approximately $14.04 per year, holding other factors constant.

Model Fit

Decomposition of Model Variance

The following table breaks down the total variation in the outcome variable into explained and unexplained components:

Analyzing Model Variance
Model Fit Variables Value
variance in response values 1410958.241
variance in fitted values 350040.643
variance in residuals 1060917.598
r-squared 0.248

The R² value of 0.248 indicates that sugar consumption explains 24.8% of the variability in health spending across countries. This suggests a moderate association, but also implies that 75.2% of spending variation is likely driven by a multitude of other factors, such as economic development or healthcare policy.

Cross Validation

Implement k-fold cross validation

 [1]  0.08901934  2.08033802  1.50571501  0.22974073  0.03571885  0.57156824
 [7]  0.26344297  0.18883690  0.95914753  0.24124309 15.11158664  0.50671016
[13]  0.30705568  0.18275002  0.11039892

Plot the results

The histogram displays the distribution of R² values across the 15 folds used in cross-validation. The average R² is approximately 1.492, indicated by the red dashed line. This suggests that, on average, the model is capturing more variance in the validation set than would be expected based on just fitting to the training data — which can occur when using R² as a predictive performance metric rather than a strict measure of in-sample fit.

Most R² values fall between 0 and 2, with a concentration near 0.25 to 0.75, indicating that the model often explains a moderate portion of the variability in health spending. However, there is still considerable variation in R² between folds, reflecting differences in how well the model generalizes across different subsets of countries.

This spread suggests that the model has some predictive power, but the performance is not entirely consistent, possibly due to regional or structural differences in sugar consumption and healthcare spending relationships. Importantly, there is no clear evidence of overfitting: the model performs reasonably well on the held-out folds, and the average predictive R² exceeds 0, indicating the model is better than using the mean alone.

References

Berthold, J. (2023, April 21). Sugary drink tax improves health, lowers health care costs: Sweetened beverage purchases drop 27% in Oakland, signaling potential impact of national legislation. University of California, Berkeley. Sugary drink tax improves health, lowers health care costs. https://publichealth.berkeley.edu/articles/spotlight/research/sugary-drink-tax-improves-health

Food and Agriculture Organization. (2024). FAOSTAT: Sugar & sweeteners food supply data. http://data.un.org/Data.aspx?q=Sugar&d=FAO&f=itemCode:2909

World Health Organization. (2024). Global Health Expenditure Database. https://apps.who.int/nha/database